[fix](binlog) Fix missing binlog column index when converting TTabletSchema to TabletSchemaPB#64484
[fix](binlog) Fix missing binlog column index when converting TTabletSchema to TabletSchemaPB#64484heguanhui wants to merge 2 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
f707b1d to
ca44c72
Compare
|
run buildall |
TPC-H: Total hot run time: 29004 ms |
TPC-DS: Total hot run time: 168871 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
efa3cc3 to
4ef5c5d
Compare
|
/review |
|
run buildall |
TPC-H: Total hot run time: 29303 ms |
TPC-DS: Total hot run time: 169441 ms |
|
run p0 regression |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
4ef5c5d to
c955a8e
Compare
|
/review |
c955a8e to
3e8f036
Compare
|
run buildall |
|
/review |
|
review |
There was a problem hiding this comment.
Review result: no blocking issues found.
Critical checkpoint conclusions:
- Goal and proof: The PR addresses the missing row-binlog LSN/timestamp column indexes when converting TTabletSchema to TabletSchemaPB, and adds BE unit coverage for both new conversion and old-metadata fallback paths. Existing GroupRowsetWriterTest coverage should exercise the original row-binlog writer crash path once these metadata fields resolve correctly.
- Scope: The code change is small and focused on row-binlog schema metadata and the row-binlog timestamp placeholder write path.
- Concurrency/lifecycle: No new shared mutable state, locks, threads, or non-trivial lifetime ownership are introduced.
- Configuration/compatibility: No new config items. The TabletSchemaPB fallback by column name preserves compatibility with older metadata where these optional indexes remained -1.
- Parallel paths: TabletSchema serialization already persists the resolved indexes; cloud PB conversion copies the fields when present, and the fallback also helps deserialized old PBs.
- Conditional checks: The added timestamp nullable check has a concrete path from existing BE test-created row-binlog schemas while production FE schemas keep timestamp nullable; read-time TSO replacement already handles both nullable and non-nullable columns.
- Test coverage: Added BE unit tests cover conversion and fallback. I did not run BE unit tests because no built output/test binary was present in this review checkout.
- Test results: passed. No generated result files are involved.
- Observability: No new observability appears necessary for this metadata fix.
- Transaction/persistence/data correctness: Persisted row-binlog schema metadata now carries the indexes needed by RowBinlogSegmentWriter and row-binlog readers; old metadata remains recoverable by name lookup.
- Performance: Only small schema-construction scans are added, outside hot row processing paths.
User focus: no additional user-provided review focus was present.
| IColumn* ts_col_ptr = binlog_prefix_columns[2].get(); | ||
| auto* ts_nullable_column = assert_cast<ColumnNullable*>(ts_col_ptr); | ||
| ts_nullable_column->insert_many_defaults(num_rows); // NULL placeholder (value + null map) | ||
| auto* ts_nullable_column = check_and_get_column<ColumnNullable>(ts_col_ptr); |
There was a problem hiding this comment.
not necessary, for row binlog writer ts_col_ptr should always exist
There was a problem hiding this comment.
Thanks for the review.
You're right that timestamp column should always be nullable by design. However, during my testing after fixing the binlog column index issue, I still encountered a coredump at this assert_cast in GroupRowsetWriterTest.sub_writer_rollback.
The root cause is that in the unit test environment, the timestamp column is not wrapped as ColumnNullable when creating the Block via create_block_by_cids. This is because the test uses a simplified schema creation path that doesn't go through the full FE->BE schema conversion.
I agree this should be investigated separately. But to make the current PR clean and focused on the binlog column index issue, I will split this into two PRs:
- PR1 (this PR): Fix binlog column index in
init_schema_from_thrift - PR2 (follow-up): Investigate and fix the nullable type mismatch for timestamp column in unit tests
I'll revert the change in row_binlog_segment_writer.cpp from this PR and keep only the binlog column index fix. The timestamp column issue will be addressed in a separate PR after further investigation.
Thank you for pointing this out!
There was a problem hiding this comment.
not necessary, for row binlog writer ts_col_ptr should always exist
this fixed in pr #64495
| static const std::string SEQUENCE_COL = "__DORIS_SEQUENCE_COL__"; | ||
| static const std::string BINLOG_TIMESTAMP_COL = "__DORIS_BINLOG_TIMESTAMP__"; | ||
| static const std::string BINLOG_LSN_COL = "__DORIS_BINLOG_LSN__"; | ||
| static const std::string BINLOG_OP_COL = "__DORIS_BINLOG_OP__"; |
There was a problem hiding this comment.
this add for fallback test, now fallback test will remove, so I will remove this also.
3e8f036 to
dd75689
Compare
|
|
||
| tablet_schema_pb->set_num_short_key_columns(tablet_schema.short_key_column_count); | ||
| tablet_schema_pb->set_num_rows_per_row_block(config::default_num_rows_per_column_file_block); | ||
| tablet_schema_pb->set_sequence_col_idx(tablet_schema.sequence_col_idx); |
There was a problem hiding this comment.
could just like sequence_col, change to tablet_schema_pb->set_binlog_lsn_col_idx
and tablet_schema_pb->set_binlog_timestamp_col_idx from tablet_schema?
There was a problem hiding this comment.
Thanks for the suggestion. However, TTabletSchema currently doesn't have
binlog_lsn_col_idx and binlog_timestamp_col_idx fields. Adding them would
require modifying Thrift definitions and FE code, which is a larger change.
This PR focuses on fixing the coredump in BE UT with minimal changes.
Computing the indices by column name is consistent with how other special
columns (like delete_sign_idx) are handled when TTabletSchema doesn't
provide them directly.
If you prefer, I can create a follow-up PR to add these fields to
TTabletSchema and pass them from FE to BE. But for this PR, the dynamic
approach works and is safe.
What do you think?
There was a problem hiding this comment.
ok, just leave it for now
There was a problem hiding this comment.
Thanks for the approval!
| col_ordinal_to_unique_id[i] = i; | ||
| } | ||
|
|
||
| // 构造 SchemaCreateOptions |
dd75689 to
338a31a
Compare
|
run buildall |
|
/review |
TPC-H: Total hot run time: 29031 ms |
TPC-DS: Total hot run time: 168703 ms |
|
run p0 regression |
|
run external regression |
|
need to fix row_binlog_segment_writer.cpp:334 for not nullable column |
| } | ||
| col_ordinal++; | ||
| init_column_from_tcolumn(unique_id, tcolumn, column); | ||
| if (tcolumn.column_name == BINLOG_LSN_COL) { |
There was a problem hiding this comment.
- col_ordinal++;
init_column_from_tcolumn(unique_id, tcolumn, column);
if (column->is_bf_column()) {
has_bf_columns = true;
}
+ if (column->name() == BINLOG_LSN_COL) {
+ tablet_schema_pb->set_binlog_lsn_col_idx(col_ordinal);
+ } else if (column->name() == BINLOG_TIMESTAMP_COL) {
+ tablet_schema_pb->set_binlog_timestamp_col_idx(col_ordinal);
+ }
+ col_ordinal++;
+
i think it's better this way
There was a problem hiding this comment.
Thanks! Updated as suggested. The timestamp nullable cast fix will be in a separate PR .
There was a problem hiding this comment.
why fix it in another PR? it won't pass be ut without fixing timestamp nullable cast
There was a problem hiding this comment.
You're right. I'll combine both fixes into this PR:
- Fix binlog column index in init_schema_from_thrift
- Fix test helper to set nullable for binlog special columns
This way the UT will pass without modifying row_binlog_segment_writer.cpp.
There was a problem hiding this comment.
why fix it in another PR? it won't pass be ut without fixing timestamp nullable cast
Thanks for the suggestion. I've merged the test helper fix into this PR:
- Added
BINLOG_OP_COLtoutils.h - Set
is_allow_null = truefor binlog special columns inenable_row_binlog()
Now this PR includes both fixes:
- Fix binlog column index in
init_schema_from_thrift() - Fix test helper to create nullable binlog columns
The UT should pass without modifying row_binlog_segment_writer.cpp.
PTAL. Thanks!
338a31a to
a672379
Compare
…Schema to TabletSchemaPB
a672379 to
7e2c20b
Compare
|
run buildall |
TPC-H: Total hot run time: 28990 ms |
TPC-DS: Total hot run time: 169858 ms |
|
run buildall |
TPC-H: Total hot run time: 28792 ms |
TPC-DS: Total hot run time: 168805 ms |
What problem does this PR solve?
Issue Number: #64483
Problem Summary:
When running
GroupRowsetWriterTest.sub_writer_rollback, a coredump occurs:F20260613 19:13:27.347458 row_binlog_segment_writer.cpp:69] Check failed: lsn_col_id >= 0 binlog schema missing DORIS_BINLOG_LSN
Root Cause
TabletMeta::init_schema_from_thrift()does not setbinlog_lsn_col_idxandbinlog_timestamp_col_idxinTabletSchemaPBwhen converting fromTTabletSchema. As a result, these fields remain -1 after deserialization, causing the CHECK failure.Solution
Add logic in
init_schema_from_thrift()to:__DORIS_BINLOG_LSN__,__DORIS_BINLOG_TIMESTAMP__) by nameTabletSchemaPBAdditionally, fix a nullable type mismatch for the timestamp column in
_fill_binlog_columns()by usingcheck_and_get_columninstead ofassert_cast.Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)